Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: do ambiguous_distinct_check in select #14180

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

lichuang
Copy link

@lichuang lichuang commented Jan 18, 2025

Which issue does this PR close?

Closes #10326.

refactor: do ambiguous_distinct_check in select

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

… of traversing the plan looking for projection node
@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jan 18, 2025
@lichuang lichuang changed the title fix: add missing columns into list directory fix: add missing columns into list directly Jan 18, 2025
@lichuang
Copy link
Author

lichuang commented Jan 22, 2025

@jonahgao in #10234 (comment) comment:

I think that we should handle ORDER BY similarly to HAVING, use the merged schema, add the missing columns directly in the select list, instead of traversing the plan looking for projection node.

If not traversing the plan looking for projection node, how to do ambiguous_distinct_check check? This check only be called when handling projection node in distinct, case test_distinct_on_sort_by_unprojected test this case.

@jonahgao
Copy link
Member

@jonahgao in #10234 (comment) comment:

I think that we should handle ORDER BY similarly to HAVING, use the merged schema, add the missing columns directly in the select list, instead of traversing the plan looking for projection node.

If not traversing the plan looking for projection node, how to do ambiguous_distinct_check check? This check only be called when handling projection node in distinct, case test_distinct_on_sort_by_unprojected test this case.

My thought is to handle it in select_to_plan, where the original select list and the distinct flag are both available.

@github-actions github-actions bot added the sql SQL Planner label Jan 23, 2025
@lichuang
Copy link
Author

lichuang commented Jan 25, 2025

@jonahgao in #10234 (comment) comment:

I think that we should handle ORDER BY similarly to HAVING, use the merged schema, add the missing columns directly in the select list, instead of traversing the plan looking for projection node.

If not traversing the plan looking for projection node, how to do ambiguous_distinct_check check? This check only be called when handling projection node in distinct, case test_distinct_on_sort_by_unprojected test this case.

My thought is to handle it in select_to_plan, where the original select list and the distinct flag are both available.

@jonahgao select_to_plan only works with SQL API, but sometimes people use DataFrame API directly, where test_distinct_sort_by_unprojected is this case, so only check in select_to_plan not works for DataFrame API.

@jonahgao
Copy link
Member

@jonahgao select_to_plan only works with SQL API, but sometimes people use DataFrame API directly, where test_distinct_sort_by_unprojected is this case, so only check in select_to_plan not works for DataFrame API.

My plan is for the DataFrame to keep using add_missing_columns, and for the SQL API to reuse the processing logic of HAVING. This can make the planning of SQL ORDER BY faster and more accurate.

SQL has well-known specs for handling ORDER BY. For example, ORDER BY can only reference columns from the SELECT list and the table expression, and there are limitations imposed by the GROUP BY clause. I haven’t found similar specs for sorting DataFrames, and I think they are different concepts.

@lichuang lichuang marked this pull request as ready for review February 3, 2025 13:08
@lichuang lichuang changed the title fix: add missing columns into list directly refactor: do ambiguous_distinct_check in select Feb 3, 2025
@lichuang
Copy link
Author

lichuang commented Feb 5, 2025

@jonahgao please help me review this pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify SQL planning for ORDER BY, HAVING, DISTINCT, etc
2 participants